“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.
This lesson is the fifth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will focus on exploring the relationships within your data observations comparing between experimental samples and applying a suite of cluster, dimension reduction and projection algorithms.
At the end of this lecture you will have covered visualizations related to:
grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2025-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.
Today’s datasets will focus on the smaller datasets we worked on in earlier lectures (namely our Ontario public health unit COVID-19 demographics data), and a new set of RNAseq analysis on different tissue samples from COVID-19 patients
This is a combination of datasets from previous lectures. This incorporates PHU demographic data with StatsCan census data from 2017 to produce a normalized estimate of cases, deaths, and hospitalizations across age groups and public health units in Ontario.
This is the same readcount data we looked at in Lecture 04. RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151
From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset has normalized expression counts from RNAseq data. It covers multiple samples and tissues from COVID-positive patients with a focus on lung tissue. The expression data has been unfiltered for SARS-CoV-2 expression data as well.
From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset contains patient information like viral load from tissues that were used for RNAseq.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
RColorBrewer has some hlepful palettes that we’ll need
to colour our data.
gplots will be used to help generate heatmap/dendrogram
visualizations.
FactoMineR and factoextra will be used for
PCA generation and visualization
Rtsne and umap are packages implementing
non-linear projection algorithms
# Some packages can be installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cran.r-project.org
## Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.5 (2021-03-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ComplexHeatmap'
## Installation paths not writeable, unable to update packages
## path: C:/Program Files/R/R-4.0.5/library
## packages:
## boot, class, cluster, codetools, crayon, evaluate, foreign, KernSmooth,
## lattice, mgcv, nlme, nnet, pbdZMQ, rlang, rpart, spatial, survival
## Old packages: 'abind', 'ade4', 'ape', 'askpass', 'backports', 'basefun',
## 'bdsmatrix', 'bench', 'BiocManager', 'bit', 'bit64', 'bitops', 'brew',
## 'brio', 'broom', 'bslib', 'cachem', 'Cairo', 'callr', 'car', 'classInt',
## 'cli', 'clue', 'commonmark', 'corrplot', 'covr', 'cowplot', 'coxme',
## 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI', 'dbplyr',
## 'dendextend', 'DEoptimR', 'desc', 'digest', 'downlit', 'dplyr', 'dreamerr',
## 'DT', 'e1071', 'fansi', 'fastmap', 'fixest', 'foghorn', 'fontawesome', 'fs',
## 'future', 'gee', 'geomtextpath', 'gert', 'ggnewscale', 'ggplot2', 'ggsci',
## 'gh', 'git2r', 'glmmTMB', 'globals', 'glue', 'haven', 'highr', 'htmltools',
## 'htmlwidgets', 'httpuv', 'httr2', 'hunspell', 'ISwR', 'jpeg', 'jsonlite',
## 'knitr', 'Lahman', 'later', 'leaps', 'lintr', 'listenv', 'lme4', 'lubridate',
## 'maps', 'markdown', 'MatrixModels', 'matrixStats', 'mclust', 'mice',
## 'microbenchmark', 'mime', 'minqa', 'mlt', 'mockery', 'mratios', 'multcomp',
## 'mvtnorm', 'nloptr', 'odbc', 'openssl', 'ordinal', 'pander', 'parallelly',
## 'parsedate', 'pillar', 'pingr', 'pixmap', 'pkgbuild', 'pkgdown', 'pkgload',
## 'plyr', 'polyclip', 'prettyunits', 'processx', 'profvis', 'progress',
## 'promises', 'ps', 'purrr', 'quantreg', 'R.oo', 'R.utils', 'ragg', 'Rcpp',
## 'RcppArmadillo', 'RcppEigen', 'RcppTOML', 'RCurl', 'readr', 'readxl',
## 'remotes', 'renv', 'repr', 'reprex', 'reticulate', 'rhub', 'rjson',
## 'RMariaDB', 'rmarkdown', 'RMySQL', 'robustbase', 'roxygen2', 'RPostgres',
## 'RPostgreSQL', 'rprojroot', 'rsconnect', 'RSpectra', 'RSQLite', 'rstudioapi',
## 'rvest', 'rzmq', 's2', 'sandwich', 'sass', 'sessioninfo', 'sf', 'shiny',
## 'showtext', 'SimComp', 'sp', 'SparseM', 'spatstat.data', 'spatstat.geom',
## 'spatstat.random', 'spatstat.utils', 'spelling', 'splancs', 'stringi',
## 'stringr', 'survPresmooth', 'sys', 'sysfonts', 'systemfonts', 'testthat',
## 'textshaping', 'TH.data', 'tidyr', 'timechange', 'tinytex', 'TMB', 'tram',
## 'tzdb', 'ucminf', 'units', 'usethis', 'utf8', 'uuid', 'V8', 'vctrs',
## 'vdiffr', 'vipor', 'viridis', 'vroom', 'waldo', 'webutils', 'wk', 'writexl',
## 'xfun', 'xml2', 'xopen', 'xts', 'yaml', 'zip', 'zoo'
install.packages("FactoMineR", dependencies = TRUE)
## Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning: dependency 'emmeans' is not available
## also installing the dependencies 'FactoInvestigate', 'missMDA', 'Factoshiny'
##
## There are binary versions available but the source versions are later:
## binary source needs_compilation
## FactoInvestigate 1.7 1.9 FALSE
## missMDA 1.18 1.19 FALSE
## Factoshiny 2.4 2.7 FALSE
## FactoMineR 2.4 2.11 TRUE
## installing the source packages 'FactoInvestigate', 'missMDA', 'Factoshiny', 'FactoMineR'
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'FactoMineR' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'FactoInvestigate' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'missMDA' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'Factoshiny' had non-zero exit status
install.packages("factoextra", dependencies = TRUE)
## Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning: dependency 'emmeans' is not available
## also installing the dependencies 'FactoMineR', 'igraph'
##
## There are binary versions available but the source versions are later:
## binary source needs_compilation
## FactoMineR 2.4 2.11 TRUE
## igraph 1.3.0 2.1.4 TRUE
##
## package 'factoextra' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mokca\AppData\Local\Temp\Rtmp0Qwv86\downloaded_packages
## installing the source packages 'FactoMineR', 'igraph'
## Warning in install.packages("factoextra", dependencies = TRUE): installation of
## package 'FactoMineR' had non-zero exit status
## Warning in install.packages("factoextra", dependencies = TRUE): installation of
## package 'igraph' had non-zero exit status
install.packages("Rtsne")
## Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
## (as 'lib' is unspecified)
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## Rtsne 0.16 0.17 TRUE
## installing the source package 'Rtsne'
## Warning in install.packages("Rtsne"): installation of package 'Rtsne' had
## non-zero exit status
install.packages("umap")
## Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
## (as 'lib' is unspecified)
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## umap 0.2.8.0 0.2.10.0 TRUE
## installing the source package 'umap'
## Warning in install.packages("umap"): installation of package 'umap' had
## non-zero exit status
# ------------ Legacy installation code which we hopefully never need again ----------#
# We need to specifically install a package called pbkrtest for the facto packages
# This is due to using an older verion of R
# library(remotes)
# install_version("pbkrtest", "0.5.1")
# Packages to help tidy our data
library(tidyverse)
## -- [1mAttaching core tidyverse packages[22m --------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 2.0.0 --
## [32mv[39m [34mdplyr [39m 1.1.3 [32mv[39m [34mreadr [39m 2.1.4
## [32mv[39m [34mforcats [39m 1.0.0 [32mv[39m [34mstringr [39m 1.5.0
## [32mv[39m [34mggplot2 [39m 3.4.4 [32mv[39m [34mtibble [39m 3.2.1
## [32mv[39m [34mlubridate[39m 1.9.2 [32mv[39m [34mtidyr [39m 1.3.0
## [32mv[39m [34mpurrr [39m 1.0.2
## -- [1mConflicts[22m --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## [31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
## [31mx[39m [34mpurrr[39m::[32mis_empty()[39m masks [34mgit2r[39m::is_empty()
## [31mx[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
## [31mx[39m [34mdplyr[39m::[32mpull()[39m masks [34mgit2r[39m::pull()
## [31mx[39m [34mpurrr[39m::[32mwhen()[39m masks [34mgit2r[39m::when()
## [36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
library(readxl)
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## The following object is masked from 'package:git2r':
##
## add
# Packages for the graphical analysis section
library(viridis)
## Loading required package: viridisLite
# library(gplots) # heatmap2()
library(ComplexHeatmap) # Heatmap()
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.6.2
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(RColorBrewer)
# Useful for PCA and PCA visualization
library(FactoMineR)
## Error in library(FactoMineR): there is no package called 'FactoMineR'
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Data projection packages
library(Rtsne)
library(umap)
Last week we looked at an analysis of RNAseq data through a number of methods starting with broad-level volcano plots and moving towards gene-level expression visualizations with dotplots. In between we stopped to take a look at heatmaps. In this instance we simply used heatmaps to convey expression levels of multiple genes across one or more samples.
Looking more broadly, we now wish to ask questions such as “how similar are our samples?”, “can our samples be grouped or categorized in some way?” and “is there an underlying structure or architecture to our data?” We briefly discussed scatterplot matrices to help analyse our sample quality among replicate experiments.
We can study these questions using a number of techniques that range from clustering/categorization to projection of high-dimensional data onto a lower set of dimensions (usually 2 or 3).
We cannot begin our journey until we talk about the nature of the data we are interested in examining. Usually, our data will consist of many samples (observations) with some number of features (variables) captured about each observation. For example, with RNAseq data we could consider the measurements we capture for each gene as a separate feature/variable/column.
Conversely you may have hundreds or thousands of samples you’d like to categorize in some way (or show that you can categorize) with just a smaller set of features. For every nut, there is a tool of sorts for cracking it!
We’ll start with a COVID dataset collected by Ontario. It consists of a collection of demographic case data from across 34 public health units (PHUs) from January 2020 to March 2022. Observations are broken down by PHU and age group. The demographics data has been modified to include a set of normalized values (individuals per 100k) that was calculated based on StatsCan 2017 census data. Modeling off of these data, the 2022 sizes for each age group were estimated. Case, death, and hospitalization counts were normalized based on these subgroup sizes to generate values per 100K individuals. We’ll be working with this 2022 dataset mainly because it utilizes some fine-grained binning of age-groups and is an appropriate size for first investigating the idea of clustering data.
The dataset will be found in
phu_demographics_census_norm.csv and we’ll use it to guide
us through two sections of today’s lecture.
Let’s begin by loading the data and taking a look at its structure.
# Import the normalized demographics data
covid_demographics_norm.df = read_csv(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Look at it's structure
str(covid_demographics_norm.df, give.attr = FALSE)
## Error in str(covid_demographics_norm.df, give.attr = FALSE): object 'covid_demographics_norm.df' not found
Let’s begin looking at our covid_demographics_norm.df.
Using a series of data sets, we’ve created a consolidated dataset
with:
Cumulative cases, deaths, and hospitalizations due to COVID-19 within each age group per PHU.
Representation of each age group within each data category as a percent of total incidents in each PHU.
Using 2017 census data, the number of cases per 100,000 individuals normalized by estimated population size for each age group within each PHU.
The question we want to answer is: of the 34 public health units, which look most similar based on the normalized case data for each age group? In order to visualize this data, we’ll want to convert our current long-form data that looks something like this:
| public_health_unit | age_group | total_cases | … | cases_per_100k | deaths_per_100k | hospitalizations_per_100k |
|---|---|---|---|---|---|---|
| Algoma | 0 to 4 | 181 | … | 3302 | 0 | 164 |
| Algoma | 12 to 19 | 412 | … | 4464 | 0 | 54 |
| … | … | … | … | … | … | … |
Into something like this:
| public_health_unit | population_2022 | category | 0 to 4 | 5 to 11 | 12 to 19 | 20 to 39 | 40 to 59 | 60 to 79 | 80+ |
|---|---|---|---|---|---|---|---|---|---|
| Algoma | 117840 | cases | 3302 | 4464 | 7570 | 4301 | 4890 | 2331 | 4116 |
| … | … | … | … | … | … | … | … | … | … |
We need to do the following to the dataset
# Create a wide-format version of our normalized data
covid_demographics_norm_wide.df <-
# Start by passing along the long-form normalized data
covid_demographics_norm.df %>%
# Select for just the PHU, age_group, population size and cases/hosp/deaths_per_100k
dplyr::select(c(3,4,11,13:15)) %>%
# Pivot the data to a longer format
pivot_longer(cols = -c(1:3), names_to = ..., values_to = ...) %>%
# Get rid of the suffix of "_per_100k"
mutate(category = ...(string = category, pattern = "_per_100k")) %>%
# Pivot the age_group/per_100k data out wider
pivot_wider(names_from = ...,
values_from = ...
) %>%
# Move the "5 to 11" category to after "0 to 4". You could use a longer "select" call to do this too!
relocate(., `5 to 11`, .after=`0 to 4`)
## Error in relocate(., `5 to 11`, .after = `0 to 4`): '...' used in an incorrect context
# Take a look at our resulting dataframe
head(covid_demographics_norm_wide.df)
## Error in head(covid_demographics_norm_wide.df): object 'covid_demographics_norm_wide.df' not found
At this point, we would normally be prepared to visualize our data
with ggplot but our data is in wide-format
and we’ll be using a package that prefers to work with
matrix data. In that case, we need to strip the data
down further because matrix data must be all of the same type and we
want to work with numeric data! We’ll use as.matrix() for
our conversion.
category variable from it.public_health_unit information over to
the row names of our matrix so we can still track our data
properly.population_2022 column of data but
we’ll have to remember that it’s there.# Now we need to make our matrix and assign row names.
# It's kind of awkward to need this requirement.
# Cast our matrix and drop the first column
covid_demographics_norm.mx <- covid_demographics_norm_wide.df %>%
# Filter for just case data
dplyr::filter(...) %>%
# Drop the PHU and category data
dplyr::select(-c(public_health_unit, category)) %>%
# Convert to a matrix
as.matrix()
## Error in dplyr::select(., -c(public_health_unit, category)): '...' used in an incorrect context
# Set the row names using the information from the data frame
rownames(covid_demographics_norm.mx) <-
covid_demographics_norm_wide.df %>%
# Pull the PHU names
... %>%
# Create a unique vector of them
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a peek at our data
head(covid_demographics_norm.mx)
## Error in head(covid_demographics_norm.mx): object 'covid_demographics_norm.mx' not found
Heatmap()From the ComplexHeatmap package we can use the function
Heatmap() to generate a heatmap that can do a little more
than what we were doing with our ggplot2 version. A nice
aspect of using Heatmap() is that we can ask it to reorder
our samples along both the rows and columns. At the same time we can
present the relationship between columns elements or row elements in the
form of dendrograms.
In its current form, we have our ordered our data along the columns in an ascending ordinal arrangement. While this makes sense to us now, it may not help to visually identify the strongest trends or groupings in our data. Clustering attempts to bring order to our data by grouping data according to specific algorithms.
Overall single points are usually grouped together as neighbours with the “nearest” neighbours determined by a metric of some kind. These clusters are then further grouped using the same metrics until the entire set is presented in these ordered groups. These groupings/relationships can also be presented as dendrograms. In our current case, we aren’t concerned with determining groups per se but rather with just connecting them by their similarity and then creating a hierarchical order.
Our data is usually coded in n-dimensional space, depending on the
nature of our dataset. For covid_demographics_norm.mx our 7
columns are coded by data from 34 PHUs meaning each column is coded in
34 dimensions or 34 features. Conversely, our 34 rows represent PHUs
each coded by data across 7 age groups and therefore 7 dimensions.
Important or helpful parameters to run Heatmap():
matrix: our matrix object. It must be a
matrix, and not a data frame. It could also be a vector
but that becomes a single column.col: determines the colours used for the image -
nothing to do with column names.name: the name/title of the heatmap (also use as the
legend title by default)row_* : set our row text properties including titles
and labels:
row_title, row_title_side,
row_title_gp (graphic properties)row_names_side row_names_gpcolumn_*cluster_rows and cluster_columns: logical
parameters to determine if clustering should occur (default is
TRUE)show_heatmap_legend: whether or not to show the heatmap
legendheatmap_legend_param: Set the title of the legend
specifically is list(title = "x")show_[row/col]_dend: Whether or not to show dendograms
for the row/columnThere are actually a lot of options but these should help us make a basic one. Let’s plot our data with and without a dendrogram to compare.
?Heatmap
## starting httpd help server ... done
# Create a Heatmap object
cases_hmap <-
# Supply our matrix minus the populations size
Heatmap(...,
cluster_rows = ..., cluster_columns = ..., # Don't cluster on either rows or columns
# Use column_title as the title of our heatmap
column_title = "Heatmap of COVID-19 cases in PHUs by age group: unclustered",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "cases per 100K individuals",
legend_direction = "horizontal"),
# Rotate column names to horizontal
column_names_rot = 0,
column_names_center = TRUE
)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Plot the heatmap
...(cases_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "bottom"
)
## Error in ...(cases_hmap, heatmap_legend_side = "bottom"): could not find function "..."
# Create a Heatmap object
cases_hmap <-
# Supply our matrix minus the populations size
Heatmap(covid_demographics_norm.mx[,-1],
cluster_rows = ..., cluster_columns = ..., # Cluster on both rows and columns
# Use column_title as the title of our heatmap
column_title = "Heatmap of COVID-19 cases in PHUs by age group: clustered",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "cases per 100K individuals",
legend_direction = "horizontal"),
# Rotate column names to horizontal
column_names_rot = 0,
column_names_center = TRUE
)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Plot the heatmap
ComplexHeatmap::draw(cases_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "bottom"
)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'draw': object 'cases_hmap' not found
HeatmapList objectOur heatmap above is drawn from a single dataset category - cases - but it helps us to see that there are some strong signals that differentiate between our different PHUs. Now this is a 34x7 grid where we can investigate all of the data in our dataset. What happens when we want to produce this kind of data from our other two metrics of hospitalizations and deaths?
To accomplish this, we can repeat our steps and create 3 separate
Heatmap objects but we can plot
them together as a single complex heatmap. In fact, rather than
store multiple heatmap objects, we’ll create a HeatmapList
object by concatenating together multiple Heatmap objects
using a for loop.
We’ll recreate our code from above but simplify the heatmap code a little. While we’re at it, we’ll also convert our colourscheme to viridis.
# Create a list of heatmap objects from our demographic data
hm_list <- NULL
# Create some quick vectors to help ourselves out
categories <- c(..., "hospitalizations", "deaths")
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Store the rownames we want to use on our matrices
mx_rownames <- covid_demographics_norm_wide.df %>% pull(public_health_unit) %>% unique()
## Error in pull(., public_health_unit): object 'covid_demographics_norm_wide.df' not found
# Use a loop to separate the data by category
for (i in categories) {
# Create a temporary matrix of the specific category data
temp_mx <- covid_demographics_norm_wide.df %>%
dplyr::filter(category == i) %>%
dplyr::select(-c(public_health_unit, category)) %>%
as.matrix()
# Set the rownames
rownames(temp_mx) <- mx_rownames
# Create a Heatmap object
hmap <- Heatmap(temp_mx[,-1], # Supply our matrix minus the populations size
# Use a viridis colourscale broken into 100 segments
col = viridis(100),
# Use column_title as the title of our heatmap
column_title = i,
# Rotate the legend horizontally and give it a title based on category
heatmap_legend_param = list(title = paste0(i, " per 100K individuals"),
legend_direction = "horizontal")
)
# Add this to our heatmap list
hm_list <- ...
}
## Error in eval(expr, envir, enclos): object 'categories' not found
# What kind of object is hm_list?
class(hm_list)
## [1] "NULL"
draw() your HeatmapListNow that we have our HeatmapList we can simply draw the
whole thing and it will automatically concatenate for us. Of course
there are many options we can use to display the list. One thing to note
is that the clustering of columns in each heatmap is
independent while the clustering of rows (and thus
order) is based on only the first heatmap (by default).
Using the ComplexHeatmap::draw() method to display your
HeatmapList will allow you to further customize details of
the overall figure including setting row_title and
column_title for the entire figure.
ComplexHeatmap::draw(hm_list,
column_title = "COVID-19 metrics breakdown by age group and PHU\n",
column_title_gp = gpar(fontsize = 20) # gpar() sets graphical parameters settings
)
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'draw' for signature '"NULL"'
Our heatmap above is drawn from a relatively simple dataset but it helps us to see that there are some strong signals that differentiate between our different PHUs. It become clear that while the bulk of cases in each PHU is dominated by the 20-39 age segment, the bulk of hospitalizations and deaths originate from the 80+ segment.
Of course, this data was cumulative information from January 2020 to March 2022. With the aftermath of vaccinations and different variants, a look at the current demographics may reveals a shift in these trends.
Considering that the heatmap is a 34x7 grid, it is easier to visualize and dissect all of the data in our dataset. What happens, however, when we want to produce this kind of visualization from something much larger or more complex?
Recall from last lecture that we investigated read count data from
Wyler et al., 2020 -
Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. We used this
dataset to produce a scatterplot matrix to compare between some of the
replicate data within the set. Across this set there are 36 sets of
RNA-Seq data spanning 27,011 genes.
Let’s begin by opening up the data file and getting a quick reminder of what it looks like.
# Read in your read_count data
wyler_readcounts.df <- read_tsv(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
str(wyler_readcounts.df, give.attr = FALSE)
## Error in str(wyler_readcounts.df, give.attr = FALSE): object 'wyler_readcounts.df' not found
As you might recall from last week, there is quite a bit of data in this set. So that we can more easily visualize our data, we’ll want to wrangle it a bit more by:
Why do we filter our read counts? In this instance we have chosen to filter our read counts. In some cases, you might find wild swings in expression, and it is what we’re looking for most of the time. For in-class analysis, to save a bit on memory and time, we try to limit the read counts to a narrow set to reduce our set size. It will also greatly reduce the size of our heatmap for viewing. By filtering, we’ll go from 27,011 observations to 109.
wyler_readcounts_filtered.df <-
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# filter out the low-readcount data
filter(if_all(.cols = -1, .fns = ~ .x > ... & .x < ...))
## Error in rename_with(., ~str_replace(string = .x, pattern = "\\w*_\\d*_", : object 'wyler_readcounts.df' not found
# Create our data matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, ...])
## Error in as.matrix(wyler_readcounts_filtered.df[, ...]): object 'wyler_readcounts_filtered.df' not found
# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene
## Error in eval(expr, envir, enclos): object 'wyler_readcounts_filtered.df' not found
# Check the characteristics of our matrix
head(wyler_readcounts.mx)
## Error in head(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
str(wyler_readcounts.mx)
## Error in str(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
What if we were to produce a heatmap directly with the raw data? It would be impossible to read, and including a dendrogram would actually take forever to process given the 27,011 rows of data to process across the 36 datasets. That’s why we’ll use our small subset of data as an example to build a heatmap.
Let’s plot the data now with Heatmap and see how it
looks with a dendrogram.
# Create a Heatmap object
wyler_hmap <- Heatmap(..., # Supply our matrix
cluster_rows = TRUE, cluster_columns = TRUE, # Cluster on both rows and columns
col = ...,
# Use column_title as the title of our heatmap
column_title = "Heatmap of Wyler et al., 2020 readcount data",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "readcounts per gene",
legend_direction = "horizontal"),
)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Plot the heatmap
ComplexHeatmap::draw(wyler_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "bottom"
)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'draw': object 'wyler_hmap' not found
From the above output, we can see that our heatmap is already quite hard to read and that’s coming off of using just 109 genes! We do, however, get some strong trends regarding our datasets - we can see certain sets of replicates are grouped together in the data but not all of them. This might be clearer if we were to factor in all of the datapoints or just the relevant genes that define the differences between datasets. We’ll discuss the second part of that thought later in this lecture.
Recall that what we’re really interested in, regarding these readcount data, is to ask just how similar the experiments are between each other. Whereas we were a little limited by the space needed to produce the scatterplot matrix, we were able to produce correlation values between experiments on a small scale. We can extend that visualization forward to compare between all datasets and then visually summarize the data as a heatmap.
The portion of the scatterplot matrix we’ll extend is the correlation
values. Whereas the ggpairs() function uses a Pearson
correlation, we can choose between Pearson or Spearman correlation.
Sometimes you may wish to compare both since Pearson examines linear
relationships whereas Spearman uses a ranking method to compare
variables for monotonic relationships.
To generate our matrix we’ll employ a couple of additional functions:
expand.grid(): we can use this to build a matrix of
pair-wise combination values. We’ll use these to generate the pair-wise
column combinations we want to compare with cor(). Given
two or more vectors, this function will produce a table of all possible
combinations between the vector elements supplied.
apply(): by now you should be familiar with this
function. We’ll use it to iterate through our pair-wise column
combinations and send each of those to…
cor(): this will return the correlation co-efficient
based on the method we choose (pearson, kendall,
spearman)
# Example of how you can use expand.grid to make pairwise combination between two sets of data.
expand.grid(c(...),c(...))
## Error in expand.grid(c(...), c(...)): '...' used in an incorrect context
# Take a quick look at our dataset again
head(wyler_readcounts.df)
## Error in head(wyler_readcounts.df): object 'wyler_readcounts.df' not found
# First we need to fix the column names from our data
wyler_readcounts_only.df <-
# Use the FULL set of readcount data
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# Drop the first two columns as well
dplyr::select(c(3:38))
## Error in rename_with(., ~str_replace(string = .x, pattern = "\\w*_\\d*_", : object 'wyler_readcounts.df' not found
# Create our correlation matrix
wyler_cor.mx <-
# We'll apply a function to each row of the resulting grid pairs
matrix(apply(expand.grid(c(...),c(...)), # generate the pair-wise combinations
# explore it row-by-row
MARGIN = 1,
# Apply a function to each row which generates a pearson correlation between
# a specific pair of columns
function(x) cor(wyler_readcounts_only.df[, x[1]],
wyler_readcounts_only.df[, x[2]],
method = "pearson") # Use a Pearson correlation
), # End the apply function
nrow = 36, # Cast our result as a matrix
dimnames = list(colnames(wyler_readcounts_only.df), # name the rows and columns
colnames(wyler_readcounts_only.df))
)
## Error in expand.grid(c(...), c(...)): '...' used in an incorrect context
# take a look at the final matrix
head(wyler_cor.mx)
## Error in head(wyler_cor.mx): object 'wyler_cor.mx' not found
Now that we’ve completed the correlation matrix and it appears to be correct, we can generate the heatmap of the data, allowing it to group data based on the Pearson correlation values.
# Create a Heatmap object
wyler_hmap <-
Heatmap(wyler_cor.mx, # Supply our matrix
# Cluster on both rows and columns
cluster_rows = TRUE, cluster_columns = TRUE,
col = viridis(100),
# Use column_title as the title of our heatmap
column_title = "Heatmap of RNA-Seq Pearson correlation on readcounts",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "Pearson score",
legend_direction = "horizontal"),
# Set the row/column label font size
row_names_gp = gpar(fontsize = 16),
column_names_gp = gpar(fontsize = 16)
)
## Error in is.data.frame(matrix): object 'wyler_cor.mx' not found
# Plot the heatmap
ComplexHeatmap::draw(wyler_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "bottom"
)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'draw': object 'wyler_hmap' not found
hclust()As you can see, clustering our data (on the right metric!) makes a
big difference in how our data is displayed. In our last few heatmaps,
we allowed Heatmap() to generate it’s own clustering. We
could have supplied it with a specific function to calculate distances,
or even a dendrogram object to order the data as well. Under the hood,
the function is defaulting to euclidean distances and actually passing
that information on to the hclust() function to produce the
dendrograms.
The choice of method determines how the data is
clustered. The choices include: ward.D, ward.D2, single, complete
(default), average, mcquitty, median and centroid.
In general ward.D, ward.D2
and complete strategies try to find compact small
“spherical” clusters. The single linkage method adopts
a ‘friends of friends’ clustering strategy. The other methods can be
said to aim for somewhere in between. For more information on these, you
can dig into the hclust()
documentation
We can turn to hclust() to generate just dendrograms for
us but prior to that we still need to reformat our matrix a little using
the dist() function.
dist()Remember we talked about our data being in n-dimensional space? Using those coordinates, there are a number of ways to calculate the distance between points. The simplest form is to calculate the euclidean distance between points using the generic formula:
\[d(p,q) = \sqrt{(p_{1}-q_{2})^2 + (p_{2}-q_{2})^2 + \ldots + (p_{n}-q_{n})^2}\]
but there are a number of other options as well. For instance you can
also choose the maximum distance between two components (same
dimension). The dist() function can generate these values
for you using the parameter method to determine how the
distance will be used. Your options are: euclidean, maximum, manhattan,
canberra, binary or minkowski.
The dist() function will return a dist
object which is a lower triangle distance matrix between all of the
rows/observations using the columns as coordinates.
Let’s return to our PHU data in
covid_demographics_norm.mx to try it out and see how it
works.
# Generate our distance matrix
dist(covid_demographics_norm.mx[,-1],
method="euclidean") %>% # The default method is euclidean
# Show the structure of our output
str()
## Error in as.matrix(x): object 'covid_demographics_norm.mx' not found
Now that we can see how a distance matrix is made, we can cluster it
using the hclust() method
Parameters that are important:
method: already described as the method which we
want to use to cluster our data.
k: the number of groups we would like our final data
to be split into - represented by colours.
# Make a cluster object and take a look at it
phu.hc <- hclust(dist(covid_demographics_norm.mx[,-1]),
method = ...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# What does our hclust object contain?
str(phu.hc)
## Error in str(phu.hc): object 'phu.hc' not found
# What does our merge matrix look like?
head(phu.hc$merge)
## Error in head(phu.hc$merge): object 'phu.hc' not found
fviz_dend() from
the package factoextraNow that we have generated our clustering object, you can see that it
holds a number of features including the height (length) of our
branches, an ordering of our PHUs and the order in which they merge. The
merge process is described in more detail as well with the
hclust() documentation.
For instance, our merge matrix indicates that label 13 (Leeds, Grenville
& Lanark District) is joined to label 20 (Huron Perth).
To simplify the plotting process, we can use the
fviz_dend() function which will parse through the
hclust object to produce a proper dendrogram. We can even
specify some grouping or colouring schemes based on how many groups,
k, we believe are present in our data.
We can treat the resulting plot like a ggplot to update particular elements as well.
# Plot out our dendrogram with fviz_dend
fviz_dend(..., # provide our hclust object
cex = 1.5, # set the text size to be larger
k = ..., # How many clusters do we expect in our data
palette = "Set1", # what colours will we use. Many options including RBrewer palettes
horiz = TRUE,
labels_track_height = 12000,
) +
# Bump up the text size for my old eyes
theme(text = element_text(size = 20))
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
So we’ve visualized our dataset as a dendrogram and it gives us an idea of how the samples might be related (in n-dimension space) based on the case values we produced from normalization. There was a crucial step to our last visualization, however, where we had to guess how many clusters (groups) our data would divide into. This was used to help colour parts of our tree into green, blue, and red. This, however, was a bit of a guess. On a deeper level, what goes unanswered by a simple glance of the figure is how the function decided where each group would begin and end? If we had picked k = 4, or 5, it would become more evident that the groups were broken by the most ancestral internal nodes/bifurcations in our dendrogram.
Under the hood of our process, K-means clustering implements an
unsupervised learning for uncategorized data. We don’t really know the
training labels of our data and are more interested in seeing which ones
group together. How many clusters should we aim to generate? We’ve
already seen that our data likely splits into 3 groups based on the
heatmaps and hclust() data. Will we get the same thing with
a k-means method?
When in doubt, you can quickly consult the
factoMiner::fviz_nbclust() function which can guess how
many clusters are ideal for representing your data. Note that it may not
produce your ideal number of clusters but if you’re stuck, this is a
good start.
There are three method options: wss, gap_stat, and
silhouette which all aim to minimize the number of clusters.
wss elbow method aims to minimize intracluster
variation. How compact is our clustering? A lower WSS is
better.
silhouette measures how well an object lies within
it’s cluster by computing the average distance to other members in its
own, \(a(i)\) vs other clusters \(b(i)\). We evaluate \(\frac{b(i) - a(i)}{max\{a(i), b(i)}\) with
higher values indicating better clustering. In this case, the location
of the maximum value is considered the optimal cluster.
gap_stat also compares total intra-cluster variation
but against a null reference distribution. There are a few potential
ways to interpret this for optimal value. The first, is simply to
maximize for the best gap stat value but this can be problematic if our
output shows increasing gap statistic values with increasing k. On a
similar vein, you can look for the biggest jumps/gains in gap statistic
which suggests a much more informative clustering than the previous k
value. Lastly, a 1-standard-error method has been proposed which
attempts to choose the number of clusters such that the smallest
k gap statistic is within one standard deviation of the gap
at k+1.
# How many clusters should we generate?
fviz_nbclust(x = covid_demographics_norm.mx[,-1], # Provide the dataset
FUNcluster = kmeans, # How will you cluster the data?
method=...) + # What method will you use?
theme(text = element_text(size=10))
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# How many clusters should we generate?
fviz_nbclust(x = covid_demographics_norm.mx[,-1], # Provide the dataset
FUNcluster = kmeans, # How will you cluster the data?
method="silhouette") + # What method will you use?
theme(text = element_text(size=10))
## Error in fviz_nbclust(x = covid_demographics_norm.mx[, -1], FUNcluster = kmeans, : object 'covid_demographics_norm.mx' not found
# How many clusters should we generate?
fviz_nbclust(x = covid_demographics_norm.mx[,-1], # Provide the dataset
FUNcluster = kmeans, # How will you cluster the data?
method="gap_stat") + # What method will you use?
theme(text = element_text(size=10))
## Error in fviz_nbclust(x = covid_demographics_norm.mx[, -1], FUNcluster = kmeans, : object 'covid_demographics_norm.mx' not found
kmeans()So from three different analyses we didn’t really get a concensus on the best number of clusters:
wss: Our elbow level out at around k=3 or k=4
silhouette: Our biggest jump is at k=2 which is also our biggest value
gap_stat: The biggest jump or local maximum in within-cluster distance is at k=3 although we see another jump again at k=5. However, the selected cluster size of 1 on the figure actually suggests the data isn’t appropriate for clustering!
Overall, it looks like the answer ranges between 2 and 5 clusters.
Note that if your data consistently suggest k = 1, then you shouldn’t
cluster at all! Since our data already suggests 3 clusters, let’s start
with that. We’ll use the kmeans() function to accomplish
this.
Similar in idea to hclust(), the kmeans()
algorithm attempts to generate k-partitioned groups from the data
supplied with an emphasis on minimizing the sum of squares from points
to the assigned cluster centers. This differs from hclust()
which finishes building the entire relationship via dendrogram without
actually choosing “clusters”.
We’re going to also use set.seed() for our random number
generation. We tend to think of things as random, but computationally,
randomness is built on algorithms. Therefore we can “recreate” our
randomness if we use a pre-determined starting point also known as the
“seed”.
The parameters we’re interested in using with kmeans()
are:
x the numeric matrix of our data
centers determines the number of clusters we want to
produce
nstart defines how many randomly chosen sets of k
centres you’ll use to start the analysis. Choosing multiple different
sets allows the algorithm to avoid local minima.
iter.max is the max number of iterations allowed
while trying to converge on the best minimum metric
# Compute k-means with k = 3
# Set a seed for reproducibility.
set.seed(123)
# Generate our k-means analysis
phu_cases.km <-
kmeans(..., # We'll scale our data for this (more on that later too!)
centers = 3,
nstart = 25,
iter.max = 500)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# What is the structure of our kmeans object?
str(phu_cases.km)
## Error in str(phu_cases.km): object 'phu_cases.km' not found
# K-means clusters showing the group of each individuals
phu_cases.km$...
## Error in eval(expr, envir, enclos): object 'phu_cases.km' not found
fviz_cluster()Notice the structure of this output? It includes cluster
which denotes which row belongs to each of the 3 clusters chosen. The
centre of each cluster is defined by centers which in this
case is a 3x7 array where each row uses 7 values to define their
position within our 7-dimension dataset. We can see each cluster’s
size is 18, 11, and 5 PHUs respectively. Notice, however,
that none of the original coordinates (data values) for our data are
retained in this object.
Rather than use ggplot directly, we’ll use the
ggplot-compatible fviz_cluster() function to help us
transform the values of our points from a 7-dimension coordinate system
to a 2-dimension visual format suitable for our simpler brains. Under
the hood fviz_cluster() is performing a principle component
analysis to group our data along the first two principal
components (more on what that means later too!).
The parameters we should be concerned with in this function include:
object the partitioning object for our data. In the
case of k-means, it is our kmeans object.
data is the original dataset we used to generate the
data. This will be necessary to plot the other data points.
ellipse.type determines how we’ll outline our
clusters. This comes with a number of options including:
ellipse.level whose default is 0.95.ellipse.level
to set their radius.ellipse.level around
the centre of each cluster.Let’s see how our data has been partitioned by the k-means clustering.
# Plot the cluster
# You can treat this object like a ggplot object afterwards
phu_kmeans.plot <-
fviz_cluster(object = ..., # our k-means object
data = ..., # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of PHU by normalized case number"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Print the plot!
phu_kmeans.plot
## Error in eval(expr, envir, enclos): object 'phu_kmeans.plot' not found
Let’s return to the Wyler et al., 2020 readcount data and take a look at it through a different lens. Whereas before we had generated a heatmap of the data based on looking at genes expressed within a readcount range, we’ll now take a different approach.
Let’s look at the top 500 variable genes in the dataset to help cluster them. To accomplish that we’ll want to generate the standard deviation in read count for each row (gene). We’ll utilize a few verbs we haven’t used before in this class:
rowwise(): in the same class of functions as
group_by(), this will subtly alter the tibble so that when
we perform math operations on it, these will be completed on a
row-by-row basis.
c_across(): this helper version of combine
(c()) combines values across columns within our
tibble.
# Review the wyler readcount data
head(wyler_readcounts.df)
## Error in head(wyler_readcounts.df): object 'wyler_readcounts.df' not found
# Save our results into a new dataframe
wyler_readcounts_filtered.df <-
# Start with the original read count data
wyler_readcounts.df %>%
# Rename the columns by removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# filter out the low readcount data. Low values will create wild variance easily
filter(if_all(.cols = -1, .fns = ~ .x > 10)) %>%
# Prepare to do row-wise calculations
... %>%
# Calculate the standard deviation across rows
mutate(stdev = sd(...)) %>%
# Ungroup and sort the data by descending value
ungroup() %>%
arrange(desc(stdev)) %>%
# Take the top 500 most variable genes
dplyr::slice(1:500)
## Error in ...(.): could not find function "..."
# Take a quick look at the start of your data
head(wyler_readcounts_filtered.df)
## Error in head(wyler_readcounts_filtered.df): object 'wyler_readcounts_filtered.df' not found
Now we’ll just convert our dataframe into a matrix of values so we can perform our various analyses.
# Save just the RNA-Seq data into a matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])
## Error in as.matrix(wyler_readcounts_filtered.df[, 3:38]): object 'wyler_readcounts_filtered.df' not found
# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene
## Error in eval(expr, envir, enclos): object 'wyler_readcounts_filtered.df' not found
# Take a quick look at the resulting matrix and its properties
head(wyler_readcounts.mx)
## Error in head(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
str(wyler_readcounts.mx)
## Error in str(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
Since we’re here, let’s take a quick step back and look at the heatmap from our new dataset. Does it reveal anthing to us now that it is more nuanced?
# Create a Heatmap object
wyler_hmap <-
Heatmap(..., # Supply our matrix
cluster_rows = TRUE, cluster_columns = TRUE, # Cluster on both rows and columns
col = viridis(100),
# Use column_title as the title of our heatmap
column_title = "Heatmap of RNA-Seq readcounts on top 500 most variable genes",
# Rotate the legend horizontally and give it a title
heatmap_legend_param = list(title = "readcounts per gene",
legend_direction = "vertical"),
# Remove the row names
show_row_names = FALSE,
# Set the dendrogram sizes
... = unit(40, "mm"),
... = unit(20, "mm")
)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Plot the heatmap
ComplexHeatmap::draw(wyler_hmap,
# Plot the legend on the bottom
heatmap_legend_side = "left"
)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'draw': object 'wyler_hmap' not found
Looks like the heatmap still doesn’t reveal a lot of information to us like how samples might be grouped. Let’s proceed with k-means and see if that works better. We just need to repeat our steps on a different set of data. Since we now have our most diverse genes and their readcounts, we’ll take a look at if we can cluster this information.
# How many clusters should we generate?
fviz_nbclust(t(wyler_readcounts.mx), FUNcluster = kmeans, method="wss")
## Error in t(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
fviz_nbclust(t(wyler_readcounts.mx), FUNcluster = kmeans, method="silhouette")
## Error in t(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
fviz_nbclust(t(wyler_readcounts.mx), FUNcluster = kmeans, method="gap_stat")
## Error in t(wyler_readcounts.mx): object 'wyler_readcounts.mx' not found
From the 3 methods we see that
Our WSS method produced an elbow around k = 4
Our silhouette method peaks at k = 4
Our gap_stat method sees diminishing progress also at k = 4
Let’s proceed with that in mind!
# Compute k-means with k = 4
# Set a seed for reproducibility.
set.seed(123)
# Generate our k-means analysis
wyler_readcounts.km <-
kmeans(scale(...), # We'll scale our data for this (more on that later too!)
centers = 4,
nstart = 25,
iter.max = 500)
## Error in as.matrix(x): '...' used in an incorrect context
# Take a look at the resulting k-means object
str(wyler_readcounts.km)
## Error in str(wyler_readcounts.km): object 'wyler_readcounts.km' not found
# Plot the cluster
# You can treat this object like a ggplot object afterwards
wyler_kmeans.plot <-
fviz_cluster(object = ..., # our k-means object
data = t(wyler_readcounts.mx), # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of filtered Wyler readcount data"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Print the plot!
wyler_kmeans.plot
## Error in eval(expr, envir, enclos): object 'wyler_kmeans.plot' not found
Question: This is 4-dimension data plotted on a two-dimension plane. What are Dim1 and Dim2? We’ll get to that answer soon!
Looking at the result, what’s nice about this kind of output is that we can immediately see the separation or clustering of our data. It looks like 4 was the way to go as there are 4 tight groupings from our data. It might be possible to further subdivide the group in the top left of our figure but it’s unlikely.
Based on our selection of genes, we observe:
24-, 48-, and 72-hour samples mock-treated (DMSO) and mock-infected, cluster together with 24-hour mock-treated samples infected with SARS-CoV-2.
48- and 72-hour samples mock-treated (DMSO) and infected by SARS-CoV-2 appear to share a similar profile.
24-hour samples treated with 200 nM 17AAG (HSP90 inhibitor) cluster together whether or not they are infected with SARS-CoV-2.
48- and 72-hour samples treated with 200 nM 17AAG cluster together whether or not they are infected with SARS-CoV-2.
These groupings suggest that perhaps the 24-hour SARS-CoV-infected timepoint is similar to uninfected controls. Meanwhile the 48- and 72-hour SARS-CoV-infected samples are similar but likely showing very distinct transcriptional changes due to infection. Treatment with the HSP90-inhibitor, however, appears to sufficiently alter expression profiles of infected and mock-infected cells to makes them cluster together. This would be a good starting point to begin digging further into the data.
Skeptic or believer? While our above analysis seems to make sense to us, it lacks a deeper understanding of what might be happening. For one thing, we haven’t even looked closely at the genes used to create our clustering. How many of them are relevant to the infection process? Is the clustering due to off-target effects of the HSP90 inhibitor (17AAG)? While clustering is an effective way to quickly identify if subgroups exist within your data, it’s important to follow up these findings with in-depth analyses of the data.
Yes the data clusters, but why does it cluster?
One problem we encountered in our above analysis of RNA-Seq data is that there were simply too many dimensions! With ~27k gene entries in our dataframe, using clustering to analyse the entire dataset would be memory-intensive if not impossible altogether. To circumvent this problem we attempted to subset our data in two ways - filtering by read counts, and comparing variance across datasets. These approaches were a form of dimensionality reduction - namely feature elimination. Our choices, however, may have inadvertently been throwing away data that could prove insightful! This is where other dimensionality reduction methods can help guide our analyses.
You can achieve dimensionality reduction in three general ways:
Feature elimination: you can reduce the feature space by removing unhelpful features. No information is gained but you trim down your dataset size.
Feature selection: on the reverse side you can simply choose the most important features to you. How will you decide? Some schemes include ranking your features by importance but this may suffer from information loss if incorrect features are chosen.
Feature extraction: create new independent features which are a combination of the old features! Attempt to extract the essential features of your data in a more information-dense manner.
| Technique | Description | Type | Useful for |
|---|---|---|---|
| Random forest | Popular machine learning algorithm for classification randomly chooses from subsets of features to classify data. The most frequently appearing features amongst the forests are chosen. | Feature selection | Only takes numeric inputs |
| PCA | Principal component analysis attempts to maximize variation when transforming to a lower-dimensional space. All new features are independent of one another. | Linear Feature Extraction | Actually really good at finding outliers in RNAseq data |
| PCoA | Principal Coordinate analysis uses distance matrices rather than raw feature values. These distance matrices represent the pairwise dissimilarities between objects but they are very dependent upon the criteria/methods used to determine similarity. | Linear Feature extraction | Microbial community composition analysis |
| t-SNE | t-Distributed Stochastic Neighbour Embedding is a non-linear technique similar to PCA suitable for high-dimension datasets. It attempts to minimize probabilities in mirroring nearest neighbour relationships in transforming from high to low-dimension spaces | Non-linear Feature extraction | Determine if your data has underlying structure |
| UMAP | Uniform manifold approximation and projection is projection-based like t-SNE, this technique attempts to preserve local data structure but has improved translation of global data structure. | Non-linear feature extraction | Faster than t-SNE |
Since we’re not exactly building a classifier but rather trying to find trends in our data, we won’t be looking at Random Forests here. Here we are interested in exploratory data analysis. We have a data set we want to understand, sometimes it is too complex to just project or divine 1) the underlying structure and 2) the features that drive that structure.
There’s more to explore with dimension reduction: The above table is just a small subset of potentially different kinds of dimension reduction methods. PCA for instance has 2 “variants”: Multiple Correspondence Analysis (MCA) which deals with relationships between categorical variables rather than continuous ones, and Independent Component Analysis (ICA) which also uses linear dimension reduction to identify independent components in your dataset. You can check out a little more here and over here
Going back to our PHU data, we used 7 dimensions to classify our data but what if we wanted to transform that information in some way to reduce the number of dimensions needed to represent their relationships? For our data set, 7 dimensions isn’t a lot of features but in other cases (like RNA-Seq) you might encounter feature-dense datasets and without knowing a priori which ones are meaningful, PCA provides a path forward in classifying our data.
Now there are caveats to PCA. It produces linear feature extraction by building new features in your n-dimensional space such that each new feature (kind of like a projected line) maximizes variance to the original data points. Each new component must be uncorrelated with (ie perpendicular to) the previous ones while accounting for the next highest amount of variance. More resources on how this works in the references.
How do we go about maximizing the variance (spread of red dots) to
our data features? Finding the point where variance is maximized, also
minimizes error (red line lengths). Generated by user
Amoeba on stackexchange.com
All math aside, our goal is to reduce our feature set to something smaller by trying to represent our data with these new features. Just remember that highly variable data and outliers can dominate your principal components.
For simplicity let’s head back and look at our PHU age group data again. To illustrate our example with PCA, let’s use the original data normalized by PHU population.
# View the normalized PHU age group data
head(covid_demographics_norm.mx)
## Error in head(covid_demographics_norm.mx): object 'covid_demographics_norm.mx' not found
PCA() to generate our analysisTo generate our analysis of the PHU data, we’ll use the
FactoMineR function PCA() for which there are
some parameters we’ll be using that we should discuss:
X a data frame of n rows (observations) and
p columns (numeric variables)
ncp the number of dimensions kept in the results (5
is the default)
scale.unit a boolean to scale your data (TRUE by
default)
Remember that PCA is trying to maximize distance between a principle component and all of the observations supplied. Depending on the nature of your variables you may have, for instance, two different unit types like height and mass. Smaller changes in height may be matched with much larger changes in mass or just wider overall variance. This may lead the PCA algorithm to prioritize mass over height when you’d prefer they have an equal importance. By centering your mean and scaling data to unit variance, everything is compared as a z-score, bringing the overall variance across a variable to within ~3 standard deviations.
Let’s compare PCA with and without scaling shall we?
# Build a PCA of our PHU data with scaling applied
phu_scaled.pca <- PCA(...,
scale.unit = ..., # What happens when we don't scale the data?
ncp = ...,
graph = TRUE)
## Error in PCA(..., scale.unit = ..., ncp = ..., graph = TRUE): could not find function "PCA"
# Build a PCA of our PHU data WITHOUT scaling applied
phu_unscaled.pca <- PCA(covid_demographics_norm.mx[,-1],
scale.unit = ...,
ncp = 7,
graph = TRUE)
## Error in PCA(covid_demographics_norm.mx[, -1], scale.unit = ..., ncp = 7, : could not find function "PCA"
# Take a look at the information inside our PCA object
print(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Regardless of scaling, the result of our PCA() call
produces an object with many variables we can access. Above you can see
a brief description for each variable but we are most interested in a
few particular ones:
eig holds our dimensional data but also describes
just how much each new principle component describes the overall
variation of our data.
var holds the results of all the variables. We can
use these to graph and visualize our data.
ind$coord will allow us to plot the coordinates of
our observations along the principal components.
The eigenvalues from our analysis pair with the eigenvectors (principle components) to help transform our data from the original feature set to the new set of features. While the eigenvectors may determine the directions of the new feature space, the eigenvalue represents the magnitude of the vector and in this case can be used to calculate the percent of overall variance explained by our eigenvector.
The important take-away is that we can now see just how much of our
variance is explained in each new principle component. We can access
this information directly from phu_scaled.pca or by using
the function get_eigenvalue(). We can also plot this as a
barchart instead using fviz_eig().
# Look at our eigenvalues directly
phu_scaled.pca$...
## Error in eval(expr, envir, enclos): object 'phu_scaled.pca' not found
# Use get_eigenvalue() to look at our eigenvalues
get_eigenvalue(phu_scaled.pca)
## Error in get_eig(X): object 'phu_scaled.pca' not found
See how nearly 80% of our variation is explained in just our first
dimension? Let’s use fviz_eig() to display this information
visually in what is known as a scree plot. It’s essentially a
barplot/lineplot combo but what we’re interested in is following the
lines much like our cluster-estimating WSS method. We use a “sharp
elbow” to determine how many principal components we need.
# Visualize the impact of our eigenvalues
fviz_eig(..., addlabels = TRUE) + theme(text = element_text(size=10))
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Looking at our eigenvalues we can see that even though 7 new PCs were generated, the first two explain almost 92% of our variance. That suggests that whatever linear separation in our data exists, we can recreate in a two-dimensional projection of coordinates from PC1 and PC2. This suggests that we can recreate the underlying structure of our data with these two new features instead of using a 7-dimensional space!
This makes some sense when we take a closer look at the data. For
instance, there isn’t much visual information found in our
0 to 4 age range. The small distribution of information in
these features may not do much, in the grand scheme, to change how the
PHUs are separated from each other. Then again, perhaps they do
contain information that we simply cannot see easily! How will we
know?
get_pca_var()Within our PCA, we can access information regarding how our original
variables are transformed into the new space. This is all stored in the
var element of our PCA object. We can extract the aspects
of this using the get_pca_var() and visualize these to
determine the quality of our variables and their representation by the
new principal components.
# What is the information associated with our original variables
phu.var <- get_pca_var(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
phu.var
## Error in eval(expr, envir, enclos): object 'phu.var' not found
Each original variable may, to some degree, influence the amount of information captured into each new principal component. When we look at each we can see the breakdown of variables, which in some cases, can reveal to us where more or less variation is found as well.
# What is the contribution of each variable?
phu.var$...
## Error in eval(expr, envir, enclos): object 'phu.var' not found
From above we can see that our original variables equally contribute to our first principal component but there is much more contribution in PC2 from three variables: 0 to 4, 5 to 11 and 80+. Notably these same groups had the smallest contributions to our PC1. From our previous scree plot that means that 12.8% of overall variation, which is described in PC2, is mainly from these three groups.
How many dimensions should I get and how many do I keep? It is at this point that we should discuss the maximum number of possible dimensions. Given n observations, and p features, the maximum number of dimensions after reduction is min(n-1, p). In terms of how many dimensions you keep, you should consider the general rule of thumb that at least 80% of your variation should be accounted for by the principal components that you choose. Keep that in mind!
coord and corFrom the remaining variable information, coord and
cor can both be used to plot our original variables along
different pairs of principal components. These are also known as
loadings and can range from {-1, 1}. These
values are the same because this is not a projection of our
variables onto the PCs but a correlation of them! The distance between
the origin and the variable coordinates gauges the quality of the
variables on the map. This number is summarized in the cos2
(squared cosine) element which can range from
{0,1}.
# Look at the coordinates of our variables across the PCs.
phu.var$...
## Error in eval(expr, envir, enclos): object 'phu.var' not found
phu.var$...
## Error in eval(expr, envir, enclos): object 'phu.var' not found
# What is the quality of our variables?
# The higher the value the better!
phu.var$...
## Error in eval(expr, envir, enclos): object 'phu.var' not found
fviz_pca_var()To capture all of our information in a single visualization we can
use fviz_pca_var() to assess our variables. The
fviz_pca_var() function will plot cos2 values on a
2-dimension axis of your choosing. High quality variables represented by
the two dimensions of your circle (ie PC1 vs PC2) will be closer to the
circumference of the circle.
Some variables may require more than 2 components to represent the data and they will therefore fall inside the circle. Low quality correlations will be closer to the origin since they correlate less with the two dimensions represented on the graph.
As we will verify from above, most of our variables are positively correlated with PC1. Across both PC1/PC2, our variables correlate highly.
Two important parameters to keep in mind:
X: the PCA object that you want to create.
axes: a numeric vector with the two dimensions you
want to examine.
# Compare how our variables contribute and correlate with PC1/PC2
fviz_pca_var(X = phu_scaled.pca,
col.var = ..., # How will we colour our data/lines
gradient.cols = c("green", "yellow", "red"),
labelsize = 6,
repel = TRUE, # make sure text doesn't overlap
axes = c(1,2) # Determine which PCs you want to graph
) +
theme(text = element_text(size=10))
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Compare how our variables contribute and correlate with PC2/PC3
fviz_pca_var(phu_scaled.pca,
col.var = ..., # How will we colour our data/lines
gradient.cols = c("green", "yellow", "red"),
labelsize = 6,
repel = TRUE, # make sure text doesn't overlap
axes = c(2,3) # Determine which PCs you want to graph
) +
theme(text = element_text(size=10))
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
fviz_pca_ind()Now that we’ve generated our PCA, let’s see if there is any clear
separation between our PHUs across the first two dimensions of new
features. Much like our variables, all of the individual coordinate data
can be found in our ind element of our PCA object. Within
there, we’ll find the coord information so we could
directly build a ggplot with phu_scaled.pca$ind$coord BUT
there’s a simpler way.
We’ll parse through the PCA object with fviz_pca_ind()
to plot our data along the first two principle coordinates. We can
define pointsize and col.ind to help add some
population size information to our PHUs.
# Graph our scaled PCA data.
fviz_pca_ind(...,
pointsize = covid_demographics_norm.mx[,1], # Set point size and colour by PHU population
col.ind = log10(covid_demographics_norm.mx[,1]),
repel = TRUE, # avoid overlapping text points
labelsize = 5,
axes = c(1,2)
) +
theme(text = element_text(size=10)) + # Make our text larger
scale_size(range = c(3, 10)) + # Update the point size range
scale_colour_viridis_c() # Change the colour scheme
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Graph our UNscaled PCA data.
fviz_pca_ind(...,
pointsize = covid_demographics_norm.mx[,1], # Set point size and colour by PHU population
col.ind = log10(covid_demographics_norm.mx[,1]),
repel = TRUE, # avoid overlapping text points
labelsize = 5,
axes = c(1,2)
) +
theme(text = element_text(size=10)) + # Make our text larger
scale_size(range = c(3, 10)) + # Update the point size range
scale_colour_viridis_c() # Change the colour scheme
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# What are the contributions of each age group to the various unscaled dimensions
phu_unscaled.pca$var$contrib
## Error in eval(expr, envir, enclos): object 'phu_unscaled.pca' not found
# Look at our unscaled eigenvalues directly
phu_unscaled.pca$eig
## Error in eval(expr, envir, enclos): object 'phu_unscaled.pca' not found
As you can see from our graphing of both PCA sets, when we choose not to scale our data something a little more drastic happens. PC1’s portion of variance increases to an 83.2% accounting of overall variance. We see many of our points move and separation between some of our PHUs is increased (Kingston/Frontenac and Peel). These changes are likely due to the larger range of values from the 20 to 39 age group which now contributes to 31% of PC1’s variation.
So think carefully about your features and what they represent. Depending on their range, and unit values, you may be inadvertently weighing them more than your other features! Scaling adjusts your data so that all features are weighted equally!
Now that we’ve effectively reduced the complexity of our data, can we produce the same k-means clustering as before? Let’s try to cluster just on the two dimensions presented, which we have already graphed!
# How many clusters should we use based on our PCA data?
fviz_nbclust(phu_scaled.pca$ind$coord[,1:2], FUNcluster = kmeans, method="wss")
## Error in fviz_nbclust(phu_scaled.pca$ind$coord[, 1:2], FUNcluster = kmeans, : object 'phu_scaled.pca' not found
fviz_nbclust(phu_scaled.pca$ind$coord[,1:2], FUNcluster = kmeans, method="silhouette")
## Error in fviz_nbclust(phu_scaled.pca$ind$coord[, 1:2], FUNcluster = kmeans, : object 'phu_scaled.pca' not found
# Generate our k-means analysis and visualize it
set.seed(123)
phu_PCA_kmeans.plot <-
kmeans(..., centers = 3) %>%
fviz_cluster(., # our k-means object
data = ..., # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of PHU by normalized case number AFTER Principal Component Analysis"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
## Error in kmeans(..., centers = 3) %>% fviz_cluster(., data = ..., ellipse.type = "convex", : '...' used in an incorrect context
# phu_pca_kmeans.plot
# Plot both of our figures together
# fig.show="hold"; out.width="50%"; out.height = "50%"
phu_kmeans.plot
## Error in eval(expr, envir, enclos): object 'phu_kmeans.plot' not found
phu_PCA_kmeans.plot
## Error in eval(expr, envir, enclos): object 'phu_PCA_kmeans.plot' not found
From our estimations, the clustering choices look near identical and this is not exactly the best dataset to work with since it is rather small but you can see that we have now transformed our data into just 2 features! Could we do the same using just 2 features from our original dataset? No! So imagine transforming a much larger and feature-heavy dataset into a smaller number of features. From the data, we are also able to discern which of our original features may really contribute to each new principal component!
Something to consider the next time you’re working with a large data set!
Why does our k-means clustering look like a PCA
plot: If you take a close look at our original
phu_kmeans.plot, you might notice the x- and y-axis are
named Dim1(79.6%) and Dim2(12.8%). You may recall these are the exact
percentage of variance explained in PC1 and PC2 of the eigenvalues in
our principal component analysis. What’s going on here? In order to plot
our original k-means cluster analysis, we provided our data matrix which
was 34x7. In order to project that 7-dimension data onto a 2-dimensional
plane, the fviz_cluster() function also needed to
internally perform a PCA followed by picking the top 2 dimensions! Thus
the point plotting was identical BUT the clustering was slightly
different because we used seven features in
our first clustering vs two features in our
second.
Now that we’ve walked through how to generate a PCA for a simpler dataset, let’s return to our RNA-Seq readcount data. We won’t trim down the data at all but rather just provide the entire dataset as a matrix for feature reduction. Let’s review the steps:
# 1. Generate our matrix from the readcount data
wyler_readcounts_all.mx <-
wyler_readcounts.df %>%
# Rename the columns by removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# Keep only the experimental observations (how many are there?)
dplyr::select(c(3:38)) %>%
# Convert to a matrix
as.matrix() %>%
# Transpose the matrix so our columns are now genes
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Name the rows of the matrix
colnames(wyler_readcounts_all.mx) <- wyler_readcounts.df$gene
## Error in eval(expr, envir, enclos): object 'wyler_readcounts.df' not found
# We'll peek at the structure of the matrix. Looking at it with head() will be messy.
str(wyler_readcounts_all.mx)
## Error in str(wyler_readcounts_all.mx): object 'wyler_readcounts_all.mx' not found
# 2. Generate the PCA object
readcounts_scaled.pca <- PCA(...,
scale.unit = TRUE, # What happens when we don't scale the data?
ncp = 40,
graph = TRUE)
## Error in PCA(..., scale.unit = TRUE, ncp = 40, graph = TRUE): could not find function "PCA"
# 3. Review how much variance is explained by the new components
readcounts_scaled.pca$eig
## Error in eval(expr, envir, enclos): object 'readcounts_scaled.pca' not found
# 4. How many clusters should we use based on our PCA data?
fviz_nbclust(readcounts_scaled.pca$ind$coord[,1:2], FUNcluster = kmeans, method="silhouette")
## Error in fviz_nbclust(readcounts_scaled.pca$ind$coord[, 1:2], FUNcluster = kmeans, : object 'readcounts_scaled.pca' not found
# 5. Generate our k-means analysis and visualize it
set.seed(123)
readcounts_PCA_kmeans.plot <-
# Let's go with 4 clusters in order to compare to our original version.
# This is close to the silhouette value as well.
# How many features to use? 6 ~ 67% of variation
kmeans(readcounts_scaled.pca$ind$coord[,1:6], centers = ..., ) %>%
fviz_cluster(., # our k-means object
data = readcounts_scaled.pca$ind$coord[,1:2], # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of PHU by normalized case number AFTER Principal Component Analysis"
) +
# Set some ggplot theme information
theme(text = element_text(size=10)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
## Error in fviz_cluster(., data = readcounts_scaled.pca$ind$coord[, 1:2], : '...' used in an incorrect context
# Look at our original plot with 4 centres
wyler_kmeans.plot
## Error in eval(expr, envir, enclos): object 'wyler_kmeans.plot' not found
# Versus our PCA with 4 centres
readcounts_PCA_kmeans.plot
## Error in eval(expr, envir, enclos): object 'readcounts_PCA_kmeans.plot' not found
If we were to compare our two versions of the k-means clustering using raw data vs our PCA dimension-reduced data, the results are pretty similar although not quite. I’ve also added what your data would look like if you did NOT scale the data before performing PCA.
| Aspect | Raw K-means | PCA K-means | PCA K-means | PCA K-means |
|---|---|---|---|---|
| Scaled | Scale at time of k-means | Scale at time of PCA | Scale at time of PCA | Unscaled at time of PCA |
| Features used for k-means | 27K genes | Top 2 components (43% var) | Top 6 components (67% var) | Top 2 components (73 % var) |
| Dim 1 | 44.6% | 26.1% | 26.1% | 50.3% |
| Dim 2 | 17.8% | 16.9% | 16.9% | 22.7% |
| Cluster 1 | 24h: all, DMSO-treated; 48h+72h: uninfected, DMSO |
24h+48h: SARS-CoV2 infection, DMSO; Some 24h+48h: uninfected, DMSO |
24h: all, DMSO-treated; 48h+72h: uninfected, DMSO |
24h: all, DMSO-treated; 48h+72h: uninfected, DMSO |
| Cluster 2 | 48h+72h: SARS-CoV2 infection, DMSO-treated | 48h: uninfected, DMSO; 72h: all, DMSO-treated |
48h+72h: SARS-CoV2 infection, DMSO-treated | 48h+72h: SARS-CoV2 infection, DMSO-treated |
| Cluster 3 | 24h: 17AAG-treated | 24h: 17AAG-treated | 24h: 17AAG-treated | 24h: 17AAG-treated |
| Cluster 4 | 48h+72h: 17AAG-treated | 48h+72h: 17AAG-treated | 48h+72h: 17AAG-treated | 48h+72h: 17AAG-treated |
Some interesting conclusions we can draw from this are
Therefore PCA can help identify factors which heavily influence your samples as long as you also correctly choose enough features to represent your transformed data.
At this point, we’ve accumulated a lot of data in memory. We won’t need our previous data anymore so let’s restart R from the menu (Session>Restart R) and reload the libraries we’ll need.
# Packages to help tidy our data
library(tidyverse)
library(readxl)
library(magrittr)
# Packages for the graphical analysis section
library(viridis)
library(RColorBrewer)
# Data projection packages
library(Rtsne)
library(umap)
How do we identify trends or groups within deeply complex data in an unsupervised manner?
So we just spent most of our time trying to understand how to transform our sets based on the large amounts of variation in our data. There are some limitations we’ve discussed but in some cases, depending on your dataset size you may be interested in finding small, local similarities rather than the large variation that comes with PCA.
There are two popular projections we’ll discuss today but first let’s
put together a more complex dataset based on some RNAseq data in
GSE150316_DeseqNormCounts_final.txt and it’s companion file
2020.07.30.20165241-supp_tables.xlsx
# Read in our RNAseq data
tissue_data.df <- ...(file = ...,
header = TRUE,
row.names = 1)
## Error in ...(file = ..., header = TRUE, row.names = 1): could not find function "..."
# Take a quick look at it
head(tissue_data.df)
## Error in head(tissue_data.df): object 'tissue_data.df' not found
dim(tissue_data.df)
## Error in eval(expr, envir, enclos): object 'tissue_data.df' not found
# Read in some additional patient data
patient_data.df <- read_excel("./data/2020.07.30.20165241-supp_tables.xlsx", sheet=2)
# Take a quick look at it
head(patient_data.df)
## [90m# A tibble: 6 x 18[39m
## `Case\r\nNo` `Viral\r\nhigh\r\nvs.\r\nviral\r\nlow*` Viral\r\nload\r\n(%\r\n~1
## [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m1[39m 1 High 81.2
## [90m2[39m 2 Low 0.5
## [90m3[39m 3 Low 2
## [90m4[39m 4 Low <0.01
## [90m5[39m 5 High 18.5
## [90m6[39m 6 Low 0.02
## [90m# i abbreviated name:[39m
## [90m# 1: `Viral\r\nload\r\n(%\r\npositive\r\ntissue\r\narea by\r\nRNA\r\nISH)`[39m
## [90m# i 15 more variables: `ISH\r\nhyaline\r\nmembrane\r\nreactivity` <chr>,[39m
## [90m# `RNA\r\nseq\r\ncoverage` <chr>, `qRT\r\nPCR**` <chr>,[39m
## [90m# `Keratin\r\ncells /\r\nmm2` <chr>, `Napsin\r\nA cells\r\n/ mm2` <chr>,[39m
## [90m# `CD1 63\r\ncells/\r\nmm2` <chr>, `CD 3\r\nCells/\r\nmm2` <chr>,[39m
## [90m# `CD 4\r\ncells/\r\nmm2` <chr>, `CD8\r\ncells/\r\nmm2` <chr>, ...[39m
dim(patient_data.df)
## [1] 24 18
Before we attempt one of our non-linear projection methods, let’s try and use some of the methods we’ve spent all this time looking at. We’ll naively repeat our PCA steps on this tissue data and see if we can pull any information from it. We know there are 10 tissue groups in our data so we’ll use that as a basis for clustering.
As we can see from the above plot, a majority of our samples from various tissue types fall into a single cluster. What’s more, as we’ll see, there is some structure in these samples as multiple samples come from the same patient. Sometimes even the same tissue! These relationships may have a large effect in the overall variation we see in the data, making it hard to distinguish between even tissue types. This is why we’ll turn to non-linear projection and see if it can help us out. Before that we’ll need to do some more wrangling…
First let’s examine our patient_data.df which has 24
patient samples (aka cases) listed in total. These 24 cases relate back
to tissue_data.df which have variables representing
different combinations of case and tissue type and observations for some
59K transcripts.
At this point we want to reformat the column names in
patient_data.df a bit before selecting for just the viral
load and viral load percentage information located in the 2nd and 3rd
column. We’ll hold onto this in patient_viral_load.df for
later use.
# Create a dataframe holding just the viral_load information for each patient
patient_viral_load.df <-
patient_data.df %>%
# Retain just the first 3 columns
dplyr::select(1:3) %>%
# Rename the 2nd and 3rd columns
...(case_num = 1, viral_load = 2, viral_load_percent = 3)
## Error in ...(., case_num = 1, viral_load = 2, viral_load_percent = 3): could not find function "..."
head(patient_viral_load.df)
## Error in head(patient_viral_load.df): object 'patient_viral_load.df' not found
We now want to format tissue_data.df much in the way we
did with our PHU data. We want to convert our current data which lists
genes as observations and tissue samples as columns. Essentially, we’d
like to transpose this and we could do that but the
transposition converts everything to a matrix. In the end, we want to
work with a data frame so we can hold more information.
To reduce on memory and runtime, however,
we should trim our dataset. We aren’t really interested in looking at
59,090 transcripts - many of which may be barely expressed. Since these
are normalized counts across the dataset, we can filter out
low-expression genes to make tissue_data_filtered.df. Yes,
this would again be considered a form of feature elimination.
In general we will:
system.time(
# Trim the tissue data down
tissue_data_filtered.df <-
tissue_data.df %>%
# Convert the row names to an actual column
rownames_to_column(var="gene") %>%
# Set up the table to perform row-wise operations
rowwise() %>%
# Calculate the mean expression of each gene across all tissue samples
mutate(mean = ...(c_across(where(is.numeric)))) %>%
# Filter for samples with low expression
filter(mean > 0.5) %>%
# Calculate overall variance in case we need to make our dataset smaller
mutate(variance = ...(c_across(where(is.numeric)))) %>%
# Arrange samples by descending variance
arrange(desc(variance)) %>%
# Remove the grouping specification
ungroup()
)
## Error in rownames_to_column(., var = "gene"): object 'tissue_data.df' not found
## Timing stopped at: 0 0 0
# Take a look at the final results
head(tissue_data_filtered.df)
## Error in head(tissue_data_filtered.df): object 'tissue_data_filtered.df' not found
# how big is our filtered data frame?
dim(tissue_data_filtered.df)
## Error in eval(expr, envir, enclos): object 'tissue_data_filtered.df' not found
Now that we’ve filtered our data down to ~29k genes, we’ll run through two more steps:
pivot_longer() and pivot_wider().# You need to transpose the data.
# We can do it with dplyr to keep it as a data frame and to add some info
tissue_RNAseq.df <-
tissue_data_filtered.df %>%
# trim down the columns to drop mean/variance
dplyr::select(1:89) %>%
# pivot longer
pivot_longer(cols=c(2:89), names_to = ..., values_to = ...) %>%
# redistribute the gene names to columns
pivot_wider(names_from = ..., values_from = ...)
## Error in tissue_data_filtered.df %>% dplyr::select(1:89) %>% pivot_longer(cols = c(2:89), : '...' used in an incorrect context
# Take a quick look at the transposed result
head(tissue_RNAseq.df)
## Error in head(tissue_RNAseq.df): object 'tissue_RNAseq.df' not found
# We want to add some additional sample information before assessing the data
tissue_RNAseq.df <-
tissue_RNAseq.df %>%
# Grab just the sample names
dplyr::select(sample) %>%
# Grab information from it like case number, tissue, and tissue number
# takes the form of "caseX.tissueY" or "caseX.tissue.NYC" or "NegControlX"
# Remember that this returns a LIST of character matrices
str_match_all(., pattern=r"(case([\w]+)\.([a-z]+)([\d|\.NYC]*)|(NegControl\d))") %>%
# Bind all the matrices from all the list elements together in a single object (likely a matrix)
do.call(rbind, .) %>%
# Convert the results to a data frame
as.data.frame() %>%
# Rename the columns based on the capture groups
dplyr::rename(., sample = V1, case_num = V2, tissue = V3, tissue_num = V4, neg_num = V5) %>%
# Coalesce some of the info due to negative control samples and clean up a column
mutate(case_num = coalesce(case_num, neg_num),
tissue_num = str_replace_all(.$tissue_num, pattern = "\\.", replace = "")) %>%
# Drop the neg_num column
dplyr::select(1:4) %>%
# Join this result to the RNAseq info
full_join(., y=tissue_RNAseq.df, by=c("sample" = "sample")) %>%
# Join that result to grab viral load information
right_join(patient_viral_load.df, y=., by=c("case_num" = "case_num"))
## Error in right_join(patient_viral_load.df, y = ., by = c(case_num = "case_num")): object 'patient_viral_load.df' not found
# Look at the resulting dataframe
head(tissue_RNAseq.df)
## Error in head(tissue_RNAseq.df): object 'tissue_RNAseq.df' not found
# How many tissue types do we have?
table(tissue_RNAseq.df$tissue)
## Error in table(tissue_RNAseq.df$tissue): object 'tissue_RNAseq.df' not found
# Generate a matrix version of our data but drop the sample metadata!
tissue_RNAseq.mx <- as.matrix(tissue_RNAseq.df[, ...])
## Error in as.matrix(tissue_RNAseq.df[, ...]): object 'tissue_RNAseq.df' not found
# head(tissue_RNAseq.mx)
str(tissue_RNAseq.mx)
## Error in str(tissue_RNAseq.mx): object 'tissue_RNAseq.mx' not found
dim(tissue_RNAseq.mx)
## Error in eval(expr, envir, enclos): object 'tissue_RNAseq.mx' not found
Rtsne packageWe now have a somewhat more complex dataset. We are still short on actual samples (now observations) but 88 observations and nearly 30K features isn’t so bad. A broad question we may wish to ask with such a data set is if there is an underlying structure to these samples - i.e. do we see grouping based on tissue type, or perhaps even sample preparation.
t-Distributed Stochastic Neighbour Embedding or t-SNE is a way for us to project our high-dimension data onto a lower dimension with the aim at preserving the local similarities rather than global disparity. When looking at data points, t-SNE will attempt to preserve the local neighbouring structure. As the algorithm pours over the data, it can use different transformations for different regions as it attempts to transform everything to a lower dimension. It is considered “incredibly flexible” at finding local structure where other algorithms may not.
This flexibility is accomplished through 2 steps:
We’ll discuss more about how this algorithm affects interpretation after seeing the results, but this is considered an exploratory data visualization tool, rather than explanatory.
To produce our t-SNE projection we’ll use the Rtsne()
function from the package of the same name. Some important parameters
are:
X is our data matrix where each row is an
observation
dims sets the number of dimensions we’d like to
project onto (default is 2).
initial_dims sets the number of dimensions that
should be retained in the initial PCA step (Default 50).
perplexity a numeric parameter that tunes between
local and global aspects of your data.
This parameter is a guess as to how many close neighbours a point may have. If you have a sense of sample types (or clusters!) ahead of time, you could try to play with this value (default is 30).
According to the algorithm this value should follow this rule: \(perplexity * 3 \lt nrow(X) -1\)
pca_scale is a boolean to set if the initial PCA
step should use scaled data.
max_iter is the maximum number of iterations in the
algorithm (default is 1000).
# Rtsne prefers using a matrix for memory issues
# set a seed for reproducible results
set.seed(2025)
# Try for perplexity of 30 can go as high as 29 before crash
# We have just 90 samples, but between 1-52 samples per "group"
tissue_tsne <- Rtsne(...,
dims=2,
perplexity=5,
verbose=TRUE,
pca_scale = TRUE,
max_iter = 1500)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# What does our t-SNE object look like?
str(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Looking above at the result of our t-SNE analysis, we can notice a few things…
Y is an 88x2 matrix holding the coordinates
for our new projection.That’s right, there is no underlying information for mapping back to our original dataset. It’s a completely black box with no way to reverse engineer the process. That’s because the process itself is stochastic! Whereas PCA was a deterministic process - repeatable the same way every time with the same data - that is not the case for t-SNE. That’s why even though it can be quite powerful in identifying local similarities, t-SNE does not provide a mathematical pathway back to our original data!
Let’s extract the information, combine it with our sample information
and project it using ggplot2. We can do this with a
scatterplot since we have a set of x/y coordinates we can work with.
# Build our new data frame from the Y values
tissue_tsne.df <- data.frame(x.coord = ...,
y.coord = ...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Add in our sample information
tissue_tsne.df <- cbind(tissue_RNAseq.df[,1:6], tissue_tsne.df)
## Error in cbind(tissue_RNAseq.df[, 1:6], tissue_tsne.df): object 'tissue_RNAseq.df' not found
# Fix up the information just a little bit to remove NA viral load information
tissue_tsne.df <-
tissue_tsne.df %>%
# replace NAs with DNW (did not work)
mutate(viral_load = replace_na(viral_load, replace = "DNW"))
## Error in mutate(., viral_load = replace_na(viral_load, replace = "DNW")): object 'tissue_tsne.df' not found
head(tissue_tsne.df)
## Error in head(tissue_tsne.df): object 'tissue_tsne.df' not found
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))
# 1. Data
ggplot(data = tissue_tsne.df) +
# 2. Aesthetics
aes(x = x.coord,
y = y.coord,
colour = ...) +
# Themes
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_manual(values = combo.colours) +
# 4. Geoms
geom_text(aes(label = case_num), size = 10)
## Error in ggplot(data = tissue_tsne.df): object 'tissue_tsne.df' not found
While we don’t have a lot of samples, you can still see that we were able to cluster some of our data by cell types without providing that classification to the algorithm! Great job team!
We can see that we get close clustering of tissues like placenta, heart, and bowel. Our liver samples are kind of everywhere but perhaps using a different perplexity would provide different results.
One interesting thing we can see is that regardless of tissue type, we see some samples are clustering together based on case number - namely case numbers 1, 4 and 5 seem to have some strong underlying expression profiles that connect them across tissue samples. We may also be seeing false relationships so beware!
This algorithm for projection is in the same flavour as t-SNE projection but has some differences including:
What does that mean for us? Faster results, and more interpretive
results! Otherwise the same issues can apply. The setup is slightly
easier with few options to change if you leave the defaults. We can
access umap() from the package of the same name. You may
also alter the default methods by creating a umap.defaults
object. More information on that here
For more tinkering, you can choose to use the uwot package instead where the
umap() function has more options that are easily
modified.
# Set our seed
set.seed(2025)
# Generate our projection
tissue_umap <- ...(tissue_RNAseq.mx)
## Error in ...(tissue_RNAseq.mx): could not find function "..."
# What does the UMAP object look like?
str(tissue_umap)
## Error in str(tissue_umap): object 'tissue_umap' not found
Looking at our UMAP object tissue_umap we see it houses
the projection parameters used but also some additional variables:
data: holds our original data matrix.layout: contains the projection coordinates we need for
plotting the data.knn: a weighted k-nearest neighbour graph. This is a
graph that connects each observation to its nearest k neighbours. This
generates the first topological representation of the data - like an
initial sketch.You may notice again that there is no data that suggests how we arrived at this solution. There are no eigenvectors or values to reverse the projection!
Let’s extract the layout information, combine it with
our sample information and project it using ggplot2
# Re-map our projection points with our tissue data
tissue_umap.df <- data.frame(x.coord = tissue_umap$...,
y.coord = tissue_umap$...)
## Error in data.frame(x.coord = tissue_umap$..., y.coord = tissue_umap$...): object 'tissue_umap' not found
tissue_umap.df <- cbind(tissue_RNAseq.df[,1:6], tissue_umap.df)
## Error in cbind(tissue_RNAseq.df[, 1:6], tissue_umap.df): object 'tissue_RNAseq.df' not found
tissue_umap.df <-
tissue_umap.df %>%
# replace NAs with DNW (did not work)
mutate(viral_load = replace_na(viral_load, replace = "DNW"))
## Error in mutate(., viral_load = replace_na(viral_load, replace = "DNW")): object 'tissue_umap.df' not found
head(tissue_umap.df)
## Error in head(tissue_umap.df): object 'tissue_umap.df' not found
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))
# 1. Data
ggplot(data = ...) +
# 2. Aesthetics
aes(x = x.coord,
y = y.coord,
colour = tissue) +
# Themes
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_manual(values = combo.colours) +
# 4. Geoms
geom_text(aes(label = case_num), size = 10)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
So it looks like without much tinkering we retrieved a fairly nice result. We can see now that liver and kidney are more closely grouped as tissues, while heart samples generally cluster together still. Bowel and jejunum appear spatially grouped and our placenta samples are still close to each other. The clustering we saw with samples 1, 4, and 5 appear to be less severe.
There does appear to be some structure between the lung samples in different case numbers so this might be an avenue to explore next to try and see if there truly is a relationship between these groups.
While t-SNE and UMAP produce projections onto a 2-dimensional plane, they did not strictly clustered the data. There is not labeling produced and you have no route back to understanding the relationships between closely plotted points. PCA, on the other hand, is strictly a dimension reduction tool. It does not place or assign datapoints to any groups BUT it is useful to use on large datasets prior to clustering!
Today we took a deep dive into principal component analysis. There are of course different variants of this based on the assumptions you can make about your observations and variables like independent component analysis (ICA, non-Gaussian features) and multiple correspondence analysis (MCA, categorical features). Some additional methods can also be used to store the transformation like a PCA does, notably principal coordinate analysis (PCoA) and variational autoencoders (VAE).
Overall we should remember that while PCA can have problems in generating it’s feature extraction, it is deterministic and repeatable. Also, the final results are provided in such a way that new observations could be transformed and projected onto the same principal components. You can also feed these components back into clustering algorithms like k-means to try and identify specific subgroups.
t-SNE and UMAP, on the other hand appear to do a much better job with high-dimensional data. They can preserve local structure and UMAP can also do a fairly good job of preserving global structure. These tools make for great exploratory analysis of your complex datasets. Interpretation of relationships, however, are not mathematically clear like in PCA. These are, after all projections from a higher dimension for our simpler primate brains!
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: April 17th, 2025.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.1: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
More information on calculating optimal clusters: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
Step-by-step how PCA works: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
More PCA explanations here: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
The math of PCA: https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
t-SNE in R and Python: https://datavizpyr.com/how-to-make-tsne-plot-in-r/
All about UMAP: https://umap-learn.readthedocs.io/en/latest/basic_usage.html
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.